Exons, introns and DNA thermodynamics 
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The genes of eukaryotes are characterized by protein coding fragments, the exons, interrupted 
by introns, i.e. stretches of DNA which do not carry useful information for protein synthesis. We 
have analyzed the melting behavior of randomly selected human cDNA sequences obtained from 
genomic DNA by removing all introns. A clear correspondence is observed between exons and 
melting domains. This finding may provide new insights into the physical mechanisms underlying 
the evolution of genes. 
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One of the most striking aspects of the human genome 
is the presence of long stretches of DNA with no apparent 
(or known) significance This is what biologists refer 
to as junk DNA, and it comprizes the majority of our 
DNA. In the human genome (and that of other higher 
eukaryotes) not only are the genes very sparse, but most 
of them are interrupted by sequences, the introns, which 
are non-coding i.e. they do not carry information for 
protein synthesis []]. During transcription introns are 
therefore removed from the messenger RNA (mRNA), 
which is assembled only from the expressed parts of the 
gene, the exons. In the human genome introns are on 
average ten times longer than exons and thus constitute 
the majority of the gene. Procaryotes (such as bacteria) 
instead have a very compact genome without introns . 

The discovery of introns in 1977 triggered a debate 
around their significance and origin, which lead to the for- 
mulation of the "introns early" 0, y, Q an d the "introns- 
late" theories [1, 0, • According to the "introns early" 
viewpoint the introns appeared at the origin of life and 
the exons were small ancient genes. The bacteria then 
lost the introns due to selective pressure in order to keep 
their genome short. The "introns late" -theory instead 
claims that introns must have appeared much later, ie 
during the early eukaryotic evolution. A consensus be- 
tween these opposing views has meanwhile been reached 
in recent years. The analysis of an increasing number of 
genes showed that most of the introns have a "recent" 
origin, although few are still believed to be very old @- 
The mechanism by which introns were included into the 
genome is, however, still poorly understood (for a recent 
discussion, see eg. Ref 9J). 

In this paper we present the results of a study of 
the physical properties of human DNA sequences which 
points to a possible pathway leading to intron insertion 
in genes. By means of a statistical mechanics approach 
we analyze the thermodynamic stability ( "melting" ) of 
DNA sequences obtained by assembling the exons to- 
gether. This is known as complementary DNA (cDNA) 
and can be obtained in the laboratory by reverse tran- 
scription of mRNA. As illustrated in Fig. cDNA is 
characterized by exon-exon boundaries and the bound- 



aries between the coding sequence (CDS) and the un- 
translated region (UTR). If introns were inserted "re- 
cently" into the genome, then the cDNA roughly resem- 
bles an ancient gene, apart from the mutations which oc- 
curred since the insertion of the first introns (see below). 
We find that exon-exon boundaries in cDNA sequences 
are strongly correlated with their melting domains. 

DNA melting is the process by which the double- 
stranded molecule in solution dissociates into two sep- 
arate strands by an increase of temperature [T(i| . Frag- 
ments which are longer than 1000 base pairs dissociate 
through a multistep process in which different parts of 
the chain melt at different temperatures. These "melt- 
ing domains" are typically a few hundreds of nucleotides 
long. The thermodynamics of the DNA melting process 
has been investigated both experimentally [jjj] and by 
means of numerical calculations based on the statistical 
mechanics of the dissociation process 0, . The latter 
approach allows one to calculate 9i, the probability that 
the i-th base pair is bound at a temperature T. The total 
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FIG. 1: Schematic view of genomic and cDNA. Exons are the 
segments of the gene transcribed into mRNA, while introns 
are spliced out. The cDNA, which is a reverse-transcription 
from the single stranded mRNA, contains no introns. It is 
characterized by exon-exon boundaries (vertical solid lines) 
and boundaries between the protein coding sequence (CDS) 
and untranslated regions (UTR) (vertical dashed lines). The 
part of the exons shown in black contains the protein coding 
sequence. The 3' and 5' ends refer to those of the single 
stranded mRNA. 
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FIG. 2: Differential melting curve and melting domains for 
the human /3-actin cDNA (NCBI entry code: NM.001101). 
Horizontal axis: temperature, vertical: —Nd9/dT, and se- 
quence length. Vertical bars in the graph indicate the regions 
along the chain for which 9i < 1/2. Horizontal solid lines are 
exon-exon boundaries and dashed lines are boundaries be- 
tween the protein coding part of the sequence (CDS) and the 
untranslated regions (UTR). A remarkable overlap between 
the genomic and thermodynamic domains is observed. 

fraction of bound base pairs is then 9 = ^ i &i/N, where 
N is the number of nucleotide pairs in the molecule. The 
multistep nature of the melting transition can be seen in 
a plot of 9 vs. T. This quantity vanishes while increas- 
ing the temperature through a series of jumps, which 
correspond to sharp peaks in a plot of —d9/dT versus T, 
the differential melting curve. The parameter 9 can also 
be measured by UV absorption of DNA in solution 01 • 
Typically statistical mechanics programs reproduce the 
experimental results quite well |13| . 

In Fig. |21 we plot the melting curve —Ndd/dT, as ob- 
tained by a statistical mechanical calculation [13j, for 
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FIG. 3: As in Fig. H but for the human CDK4 cDNA (NCBI 
entry NM_000075). 
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FIG. 4: As in Fig. H but for the EHHADH human gene 
(NCBI entry NM.001966). 



the human /3-actin cDNA, where TV is the total length 
of the sequence (N = 1792 in this case). We used the 
same stacking energies and loop entropic parameters as 
in Ref. The salt concentration was fixed at 0.05 M. 
The three main melting peaks of Fig. |21 indicate three 
sharp subtransitions which characterize the melting of 
the sequence. The evolution of the average configuration 
of the sequence as a function of T can be read-off from 
the vertical bars, which denote, at the given tempera- 
tures, the regions which are more likely to be dissoci- 
ated, i.e. where 0j < 1/2. For instance, the bar shown 
at T « 85°C indicates that the region with i > 850 is 
dissociated, while that with i < 850 is in a helical state. 
These melting domains are plotted for temperatures be- 
tween melting peaks, so that, by comparing the configu- 
rations at temperatures below and above each peak, one 
can visualize the regions of the sequence involved in the 
multistep melting. We refer to the nucleotides separat- 
ing two neighboring regions of the sequence with 9 < 1 /2 
and 9 > 1/2 as the thermodynamic boundaries. In Fig. 
121 exon-exon boundaries are indicated as horizontal solid 
lines, while the boundaries between the CDS and UTRs 
are shown as dashed lines. The numbers on the left ver- 
tical axis, located at the exon-exon boundaries, show the 
length on the introns in the genomic DNA. 

In the example of Fig. |21 the melting process starts 
with the opening of small loops in the 3'UTR, while the 
first sharp peak at 80° C is the dissociation of the whole 
3'UTR region. The next peak at about 84° C is due to 
the melting of the exons 5 and 6 (numbering them from 
the 5' region), while the melting of the exons 3 and 4 
occurs at higher temperature (ss 86° C). A remarkable 
overlap between the locations of the thermodynamic and 
genomic domains is observed. 

An equally striking correspondence is found in most 
of the human cDNA sequences we investigated. Figure 
13 shows the melting curve for the cDNA of the Cyclin 
dependent kinase (CDK4). Occasionally, we found some 
discrepancies, as can be seen e.g. in Fig. 0] which shows 
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FIG. 5: Histogram of the thermodynamic boundaries (think 
line) of the Human cDNA encoding for the interleukin en- 
hancer binding factor 2 (ILF2) protein with Gene Bank entry 
NM_004515. The thin vertical lines denote exon-exon (solid) 
and CDS-UTR (dashed) boundaries. The arrows indicate the 
exon-exon boundaries detected by the melting analysis. 



the cDNA for the human HAADH gene. The inset shows 
an enlargement of the region for the temperature inter- 
val of 80-85°C. Note that the thermodynamic boundary 
indicated by the arrows splits the third exon of the se- 
quence in roughly two equal parts. 

Long cDNA sequences (> 3000 bp) may have a very 
complex melting curve with many overlapping peaks. In 
order to have a better criterion for the definition of ther- 
modynamic boundaries we have performed a tempera- 
ture scan from 60 to 100°C and calculated the bound- 
aries separating the 9{ < 1/2 to the 9i > 1/2 regions at 
a fixed interval AT = 0.01°C. We have then derived a 
histogram hi over all base pairs i as follows: if i is found 
to be a thermodynamic boundary between two tempera- 
tures T\ and Ti > T±, the contribution to the histogram 
is hi = (T 2 - TO/AT. 

FigureElshows the histogram hi as function of i for the 
Human ILF2 cDNA (thick line). The solid and dashed 
thin vertical lines denote the exon-exon and CDS-UTR 
boundaries, respectively. The histogram is characterized 
by few main thermodynamic boundaries, well above the 
noise level (as typically observed in all cases). The ad- 
vantage of hi with respect to a plot of the differential 
melting curves is that boundaries appear more clearly 
also for long sequences, and their stabilities can be quan- 
tified from the height of hi. The ILF2 cDNA melting of 
Fig. [5] is yet another example of the good correspon- 
dence between thermodynamics and genomic features. 
The exon-exon boundaries "detected" by thermodynam- 
ics are shown as vertical arrows in Fig. [S] 

The correspondence between melting domains and ge- 
nomic features has also been explored recently in studies 
of lower eukaryotes as S. Cerevesiae |15| , P. Falciparum 
or D. Discoideum 01- These studies focused on ge- 
nomic DNA and in particular on the correspondence be- 
tween exon-intron and thermodynamic boundaries. This 
correspondence allowed Yeramian et al. to locate genes 
in the P. Falciparum |18| and D. Melanoganster [l9| 
genome. Exon-intron boundaries may be difficult to de- 



400 
200 




300 
200 
100 




L 


2 




3 4 

/ \ 

i. 


5 

\ 

J 






200 
779 


400 

7 


60 
16 100 




6 435 



\ 



intron 
1047 bps 







500 



1000 



1500 



FIG. 6: (a) Melting histogram for the cDNA sequence encod- 
ing for the ribosomal protein Lll (RPL11) with NCBI entry 
NM_000975. The numbers above the exon-exon boundaries 
denote the length of introns in the genomic DNA. (b) The 
same sequence as in (a) to which the intron 2 of 1047 bp is 
added (in color). 



tect in higher eukaryotes by melting analysis as introns 
tend to be very long. We have illustrated this in Fig. 
which shows the melting histogram for the human riboso- 
mal protein Lll cDNA sequence (a) and for the cDNA in 
which the second intron has been inserted. In the cDNA 
the strongest peaks in the histogram are correlated with 
exon-exon boundaries this holds for the boundaries 1,2 
and 5 (notice the thermodynamic boundary 2 is shifted 
of 15 base pairs compared to the exon-exon boundary), 
but notice also some weaker signals close to the bound- 
aries 3 and 4 (such weak signals have not been taken 
into account in the histogram of Fig. EJ). When the in- 
tron is inserted (Fig. ffi b)) many other "spurious" peaks 
appear, therefore exon-introns boundaries may be diffi- 
cult to detect from thermodynamics, although they still 
persist. 

We have analyzed 48 genes for which exons-introns 
boundaries have been experimentally confirmed, selected 
randomly from the GenBank Refseq set pcj, and 35 
housekeeping (HK) genes, taken from Ref. |2l| (HK 
genes are virtually expressed in all tissues Q). For each 
sequence we have produced a histogram of thermody- 
namic boundaries as those of Figs. ElandEl We recorded 
the position of the major peaks and calculated the scaled 
distance from exon-exon boundaries. As an example, if a 
thermodynamic boundary is found at position z t h and it 
is contained in an exon beginning at i\ and ending at 12 
the scaled distance is defined as x = (ith — *i)/(*2 — 
thus < x < 1 . Figure shows a plot of the func- 
tion N a (x), defined as the number of observed thermo- 
dynamic boundaries in the interval x — a/2, x + a/2. 
For reference, this result is compared to that obtained 
from random distributions of uncorrelated boundaries, 
which is a constant (dashed lines in Fig. 01. The plot 
clearly demonstrates the significance of the observed cor- 
relations. As is clear from the Figs. !2l5l not all exon-exon 
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FIG. 7: Plots of N a (x), the number of observed thermody- 
namic boundaries in the interval x — a/2, x+a/2. (a) Set of 48 
human cDNAs selected at random from the Refseq Genbank 
set |2C|| and (b) set of 35 human cDNA from Housekeeping 
genes taken from |2l|. The dashed lines corresponds to a 
random distribution of uncorrelated exon-exon and thermo- 
dynamic boundaries. 
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FIG. 8: Schematic view of a possible pathway of intron inser- 
tion in the genome: introns seem to have targeted preferen- 
tially thermodynamic boundaries. 



boundaries are "detected" by thermodynamics. Our sta- 
tistical analysis indicates that the detection score is of 
roughly 35%. 

In our view, the correspondence between exon-exon 
and thermodynamic boundaries suggests a possible path- 
way of introns insertion in the genome during evolu- 
tion. Introns seem to have targeted exposed bases at 
a "fork" between a double helix and an open DNA re- 
gion as schematically shown in Fig. [S] The exposed 
bases could have provided sites where binding with "for- 
eign" introns sequences was possible, probably because 
the two strands are still sufficiently close to each other 



to integrate efficiently the intron. To our knowledge, 
the idea that the thermodynamic stability of the dou- 
ble stranded DNA played a role in the intron insertion 
in eukaryotic genomes has not been considered in other 
studies. The emphasis so far has been put on the spe- 
cific sequence composition of few base pairs around the 
insertion site; for instance it has been shown |fj that 
the upstream exon tends to end with (C/A)AG and the 
downstream exon tends to start with GT, which were 
interpreted as sequence specific targeting of some still 
uncharacterized intron insertion machinery. As the pre- 
cise biochemical mechanism for the introns insertion has 
not yet been understood, both hypothesis of insertion 
at specific sequence sites or at thermodynamic bound- 
aries remain plausible. Since about 1/3 of the exon-exon 
boundaries are "detected" by our thermodynamic anal- 
ysis, it is also possible that other mechanisms of introns 
insertion may have been used during evolution as well. 
A systematic study of the relationship between thermal 
and exon-exon boundaries combined with a phylogenetic 
analysis for different species in the eukaryotic kingdom, 
may provide further insights in these issues. 

Finally, the boundaries separating regions of DNA 
with different stability properties, found in the melt- 
ing analysis, should manifest themselves also under non- 
equilibrium conditions. Mechanical unzipping of DNA, 
for instance, is known to generate metastable forks [22| . 
Moreover the thermodynamic boundaries obtained from 
statistical mechanics approaches are robust with respect 
to "realistic" changes in the stacking energies and en- 
tropic parameters ^| . These modifications do not 
affect the location of the thermodynamic boundaries no- 
ticeably. Likewise, as we explicitly verified, a small per- 
centage of mutations does not have a strong influence 
on the thermodynamic boundaries. As the coding parts 
of the genes tend to be highly conserved across distant 
species PJ, we expect that the melting features of cDNA 
sequences are very similar to that of the old "protogenes" 
as they were before the intron insertions. 
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